Web Crawler with browser_cookie3
Web Crawler with browser_cookie3
— PTT Gossiping Example
This guide demonstrates how to use browser_cookie3
to retrieve cookies from your browser and use them in a web crawler to access the PTT Gossiping board, which requires a over18=1
cookie to bypass the age restriction prompt.
🔧 Requirements
Install the required libraries:
pip install requests beautifulsoup4 browser-cookie3
🔍 What is browser_cookie3
?
browser_cookie3
is a Python library that allows you to programmatically access cookies stored by your browser (Chrome, Firefox, etc.). It can be useful when accessing websites that require session or consent cookies.
📁 Sample Crawler Code
import requests
import browser_cookie3
from bs4 import BeautifulSoup
def _fetch_cookies_from_browser() -> dict[str, str]:
"""Fetches PTT cookies (like over18=1) from your Chrome browser."""
cookie_jar = browser_cookie3.chrome(domain_name="ptt.cc")
local_cookie_dict = requests.utils.dict_from_cookiejar(cookie_jar)
# Confirming if the essential cookie exists
cookies = {
"over18": local_cookie_dict.get("over18", "1")
}
print("[Fetched Cookies]", cookies)
return cookies
def crawl_ptt_gossiping():
url = "https://www.ptt.cc/bbs/Gossiping/index.html"
cookies = _fetch_cookies_from_browser()
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}
response = requests.get(url, headers=headers, cookies=cookies)
if response.status_code != 200:
print(f"Failed to retrieve page, status: {response.status_code}")
return
soup = BeautifulSoup(response.text, "html.parser")
articles = soup.select("div.title a")
print("\n🔥 Latest Posts on PTT Gossiping:")
for article in articles:
print(f"- {article.text.strip()} ({article['href']})")
if __name__ == "__main__":
crawl_ptt_gossiping()
💡 Notes
- If you’ve never visited PTT Gossiping before, make sure to open the page once in Chrome and click "我同意,我已年滿十八歲" to set the
over18=1
cookie. - You can extend this crawler to scrape post content, authors, dates, or write the data into a CSV/DB.